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Abstract. Inverse reinforcement learning (IRL) addresses the problem 
of recovering a task description given a demonstration of the optimal 
policy used to solve such a task. The optimal policy is usually provided 
by an expert or teacher, making IRL specially suitable for the problem 
of apprenticeship learning. The task description is encoded in the form 
of a reward function of a Markov decision process (MDP). Several algo- 
rithms have been proposed to find the reward function corresponding to 
a set of demonstrations. One of the algorithms that has provided best 
results in different applications is a gradient method to optimize a policy 
squared error criterion. On a parallel line of research, other authors have 
presented recently a gradient approximation of the maximum likelihood 
estimate of the reward signal. In general, both approaches approximate 
the gradient estimate and the criteria at different stages to make the 
algorithm tractable and efficient. In this work, we provide a detailed 
description of the different methods to highlight differences in terms of 
reward estimation, policy similarity and computational costs. We also 
provide experimental results to evaluate the differences in performance 
of the methods. 

Keywords: Reinforcement learning, inverse reinforcement learning, ap- 
prenticeship learning, Markov decission processes, gradient methods. 



1 Introduction 

As proposed originally by Ng and Russell [19], the objective of the inverse rein- 
forcement learning (IRL) problem is to determine the reward function that an 
expert agent is optimizing in order to solve a task based on observations of that 
expert's behavior while solving the task. The motivation of IRL is twofold. First, 
it can provide computational models for human and animal behavior. It has been 
demonstrated that human action understanding can be modeled as inverse plan- 
ning in a Markov decision process (MDP) model j3)28j . Second, it can be used 
for the design of intelligent agents, where the description of the tasks might not 
be easy to obtain. In the later, it is sometimes simpler to get demonstrations 
from an agent that already knows how to do the tasks |31I2I16|IU] . For example, 
sometimes we are unable to describe a complex body movement, but we can 
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show it so that the learner can infer how to do it by herself )15j . Furthermore, 
as discovered in [22] and exploited in [17], there is a strong connection between 
the problem of IRL and the one of structured learning [?] . 

Although the general IRL problem does not assume any model of the en- 
vironment, the formal statement of the problem that appears in the literature 
assumes a MDP [201 that the expert follows the principle of rational ac- 
tion, that is, the expert agent always tries to maximize the reward function 
[5]. This formulation is a generalization of the classical inverse optimal control 
(IOC) problem in continuous domains |7I11) . However, it is well known that, 
even with those restrictions, the problem of inverse reinforcement learning is 
ill-posed. That is, there is a virtually infinite number of rewards that accept the 
same demonstration as the optimal policy TS]. 




Thus, there are two ways of tackle the problem of IRL. One one hand, in the 
seminal paper of Ng and Russell [TH] and posterior works based on that one, such 
as [21I14| . the authors try to characterize the space of solutions of the reward 
function. On the other hand, most of the recent algorithms care about the prob- 
lem of apprenticeship learning, where the learning is less concerned about the 
actual reward function, and the objective is to recover a policy that is close to the 
demonstrated behavior [2I16I22I31I26I25| . In that way, apprenticeship learning is 
related to imitation learning or learning by demonstration. Therefore, IRL in the 
case of apprenticeship learning includes a new restriction based on the similarity 
or dissimilarity of the expert and learner behaviors 117). However, contrary to 
other approachers in imitation learning it can provide a more compact represen- 
tation in terms of the reward function, it can generalize to states that have not 
been demonstrated and it is more robust to changes in the environment. 

This papers studies an algorithm for maximum likelihood IRL from two dif- 
ferent points of view. First, it shows its connections to prior work on convex 
optimization for IRL, namely with the use of gradient methods in a least-squares 
optimization. Second, it provides a comparison of the performance of maximum 
likelihood against other IRL methods as well as of an approximation of the gra- 
dient. The results show that maximum likelihood always obtains, for the studied 
problems, the best results. Moreover, the solutions with the gradient approxi- 
mation do practically not degrade while achieving a considerable speed up in 
computational time. 

The reminder of the paper is organized as follows. After introducing the 
notation and inverse reinforcement learning methods in Section [2] Section [3] 
describes maximum likelihood IRL and discusses its connections to other algo- 
rithms. Section [3] presents the experimental results. Finally, in Section[5]we draw 
the conclusions. 

2 Preliminaries 

In this section we will introduce the notation used in this article, point to the 
basic equations we need from direct reinforcement learning, and enunciate the 
inverse reinforcement learning problem. 
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2.1 Markov decision processes 

A Markov decision process (MDP) is a tuple {X, A, P, 7, R) where 

— X is a. set of states, 

— >l is a set of actions, 

— P{x' \ x,a) = Px'ax is the probability of transitioning to state x' G X when 
taking action a G ^ in state x G X, i.e., P : X x A x X [0,1], 

— R is a reward function. R{x, a) = Rxa returns the reward for taking action 
a in state x. R : X x A ^ M. and 

— 7 e [0, 1) is the discount factor. 

The purpose of the MDP is to find the action sequence that maximizes the 
expected future reward: 



V{x) = E 



^7*i?(a;(,at) 



.t=o 



Xq = X 



(1) 



The sequence of actions is encoded in a policy, which is a mapping tt : XxA^ 
[0,1] such that 7r(.x,a) = P{a. \ x). In the case of a deterministic policy, the 
probability distribution collapses to a single action value. We can associate a 
value function with a particular policy tt, V'^{x) = E^E^g 7*-^(^*> (it)\xo = x], 
where the expectation also considers the stochasticity in the policy. Then, the 
optimal policy tt* is defined as the policy such that the associated value function 
V*{x) is greater or equal than the value function of any other policy, that is 
V* {x) = sup^ V'^{x). The optimal value function satisfies the Bellman equations: 



V*(x) = max 



i?(a:,a)+7 5]P(2/|x,a)y*(y) 

vex 



(2) 



We can also associate an action- value function, or Q-function, with each policy 



^7*i?(a;t,at) 



.t=o 



xq = X, ao = a 



(3) 



where at is generated by following policy tt for t > 0. The Q-function can also 
be updated from the Bellman equation: 



Q*{x,a) 



Rix,a)+j^Piy\x,a)V*{y) 
vex 



(4) 



Therefore, the optimal policy can be computed as 



'jT*{x) = argmax(3*(a;, a) 



(5) 
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2.2 Differentiable Markov decision processes 

One of the limitations for the Bellman formulation of MDPs is the non-differentiability 
of the maximum function. Thus, in many algorithms replace the maximum func- 
tion by a softmax function (SO] - For example, we can replace equation ^ by the 
softmax version of it, the Boltzmann policy: 

Tr*{x) = (6) 

where rj is the Boltzmann temperature. If the temperature 77 — 0"'", then equation 
(|6| becomes equivalent to equation ([s]). If 77 — )■ 00 then Tr{x) becomes a uniform 
random walk. 



2.3 Inverse reinforcement learning 

As stated before, the inverse reinforcement learning (IRL) problem consists of 
learning the reward function of a reward-less MDP, that is, MDP\i?, given a set 
of trajectories from an expert or teacher. Formally, in the IRL problem we are 
given a dataset 

V = {ix,,a,)}fi, (7) 

containing observations of an expert agent acting in the MDP Ai = {X,A,P, R, 7) 
The dataset can contain full or partial trajectories of the expert, or even some 
sparse action selections. 

We assume the expert acts near-optimally trying to solve a certain task 
encoded by the reward function R. Both the task and the reward signal are 
unknown to the learning agent. Her goal is to find a reward function that explains 
the observed behavior of the expert. 

The problem of IRL is, by definition, ill-posed [H] because different rewards 
can produce the same behavior [18j, that is, a demonstration cannot generate a 
single reward signal, neither discriminate among an infinite set of reward func- 
tions. Furthermore, the set of solutions for a demonstration include degenerate 
cases such a flat reward for every action-state pair [19]. Also, there can be some 
rewards that do not depend directly on the state of the system, but on some 
intrinsic parameters of the agent [53]; or the agent might not fully observe the 
system, in which case the direct problem becomes a partially observable MDP 
(POMDP) [515] . 

At this point is important to note that, apart from the sums in the value 
function, there is no other assumption in this paper about whether the spaces 
are discrete or continuous. In fact, many algorithms that were designed for dis- 
crete spaces have been applied in continuous setups, provided that there is a 
planning algorithm to solve the direct problem just by replacing the sums by 
the corresponding integrals |lllll30j . 
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Inverse reinforcement learning as convex optimization As presented in 
[17j . many algorithms for apprenticeship learning based on IRL, share a com- 
mon formulation[ 2ll6l22l25l5) . Basically, the objective is to find the reward that 
minimizes the similarity between the expert's and learner's behavior: 

R* = argmax J(7rfl,I?) (8) 

Those algorithms also include some extra assumptions to constrain the admis- 
sible reward set in a way to reduce the effects of being an ill-posed estimation. 
The most extended assumption is to consider that the reward function is a linear 
combination of basis functions (j){x, a), also called, state features. 

N 

Re{x,a) = '^9i(j)i{x,a) (9) 

where is a vector of feature weights. Then equation ([T]), can be rewritten in 
matrix form as V{x) = 9 (P'^^x, a), where 



.t=o 



j*(j)i{x,a) 



Xq = X 



(10) 



The selection of the features might be tricky depending on the problem. In some 
works, the IRL problem is formulated by directly trying to find an arbitrary 
R{x,a). This is equivalent to use the indicator function as feature <j)i{x,a) = 
ti{x,a). Some authors also have tried to learn the features as a part of the IRL 
problem [mil]- 

As pointed out by [H] this estimation resembles the problem of structured 
learning pP. The main hypothesis of [27\ is that, for a certain kind of combina- 
torial problems in the form of equation (|9]), computing the likelihood function 
is intractable. Therefore, they propose the max-margin method, which approxi- 
mates the correct solution with polynomial complexity. In fact, the work of |17) 
shows that under certain conditions, the problems of structured learning and 
IRL are equivalent. In this paper, we show that for IRL problems we can ob- 
tain an alternative good approximation of the likelihood function with better 
performance than the heuristic proposed for structured learning. 



Inverse reinforcement learning as probabilistic inference The problem of 
IRL can also be solved using Bayesian inference. Ramachandran and Amir [21] 
presented an algorithm called Bayesian inverse reinforcement learning, where 
they consider that the unknown reward function as an stochastic variable which 
can be inferred based on the observations from the demonstration P{R \ V) cx 
P{'D I R)P{R). Then, they introduce the following likelihood model 

P{V I R) ^l[P{x,,a, I R) cx e"^^«*(^-'^-) (11) 
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where a is a parameter of the distribution that represents the confidence on the 
expert. The hkehhood of each pair (x, a) is equivalent to the Boltzmann poHcy 
from equation ^ assuming that the confidence is the inverse of the Bohzmann 
temperature a = l/rj. Ahhough for apprenticeship learning, the computation 
of a full distribution of rewards might seem excessive, it has some advantages 
as being able to use the uncertainty in the estimation in an active learning 
framework [13j . 



Inverse reinforcement learning as density estimation Although they are 
not within the scope of the paper, it is worth mentioning a set of algorithms 
rooted on the computation of the KL-divergence (or relative entropy) between 
the optimal policy and the passive dynamics of the system |6|lll31j . In contrast 
with the two approaches described above, this family of algorithms does not 
need to solve the direct planning problem. Instead, they require to compute the 
distributions of trajectories following the passive dynamics of the system and 
the potentially optimal dynamics of the expert. Furthermore, it is not clear if 
those algorithms are comparable to other IRL algorithms since the problems 
they solve are different. For example, |17j shows that the dissimilarity function 
that those algorithms are optimizing do not use optimal policies. Besides, [TT] 
shows that those methods are within the framework of linearly-solvable MDPs. 



3 Maximum likelihood IRL 

In this section we describe in detail the algorithm introduced in [13] to solve the 
IRL problem using an estimate of the gradient of the likelihood function from 



equation (11). Then, we present new connections between that algorithm and 
other algorithms in the literature. First, to provide a uniform framework, we 
will use the reward parametrization of equation ([9|. As commented before, the 
original formulation of ^3^ can be recovered by taking (j)i{x,a) = a). 

For simplicity, we assume that the feature functions are known in advance. 
Therefore, the reward function is fully determined by the vector of weights 9. 
We define the likelihood of the data set as the product of the likelihood of the 



state-action pairs P{xi, at \ V) = ie{xi, at) defined as in equation ( 11 ), 

M 

Cg{V)^Y[ee{x,,a,) (12) 

i=l 

where M is the number of demonstrated pairs. 

Thus, a gradient ascent algorithm can be used to estimate the reward function 
R maximizing the log-likelihood function with respect to the demonstrations V. 
As described above, we are interested in computing a reward function R* such 
that: 

i?g = arg max log £9 (2?) (13) 
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subject to the constraints on the parameter weights 9i > and 



1, where 



M 



(14) 



The model in Eq. 11 assumes that the probability of the expert choosing an 
action is proportional to the action's value; e.g. actions with higher Q* are more 
likely to be selected. Given an observed pair (a;, a), the likelihood of the pair 
under the reward function Rg is defined as in equation (11), which is equivalent 
to the Boltzmann policy from equation ([6| 



3.1 Computing the gradient 

At this point, we need to determine the gradient 



VglogCeiV) 



d_ 
do 



M 



^log(£0(a;,,ai)) 



,4=1 



M 

E 



^ lg{Xi,ai) 



do 



The partial derivative of the pair Hkehhood can be calculated as: 



(15) 



{x,a) - ^£e(a;,6)^-(x,6) 



beA 



(16) 



The likelihood of the pair {xi, ai) can be easily calculated by solving the direct RL 
problem. However, we still need to find the gradient of the optimal Q-function 
with respect to the reward parameters 9. This derivative is not trivial, since 
there is double dependency of Rg in Q* , first through the accumulated reward 
and second through the optimal policy. However, we can find an estimate of the 
derivatives of the Q* functions. 



Fixed-point estimate As shown by [16j the derivatives of the Q* functions 
can be computed almost everywhere with a fixed point equation; 

'ipeix, a) = Rg{x, a) + 7 ^ P{y \x,a)^ 7r(6, y)V'e(&, v) (17) 
vex beA 

where R'g{x, a) is the derivative of the reward w.r.t. and tt is any policy that 
is greedy with respect to Qg. If the reward is in the linear form of ([9|, then 
R'gix^a) = 4>(x,a) and the solution to that equation is also the set of feature 



expectations (l>Tr{x,a) defined in equation (10 1. Thus, the resulting equation for 
the derivative of tg is 



(18) 
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Independence assumption The second estimate is an approximation based 
on the assumption that the policy remains unchanged under a small variation 
in the reward function as in '13J . The Bellman recursion of the value- function 
and Q-function from equations ^ and ^ using vector notation are: 

V* - max [Ra + jPaV*] ^ R„ + -fP^V^ 

aeA 

Ql=Ra+ iPaV* Ql = Ra+ iPaV^ 

then, we notice that for any optimal policy tt*: 

V* = {I-jP^,)-'R^, (19) 
These expressions can be combined into 

Q:^Ra+ iPa (I - iP.'T^ R.' (20) 

Let us define T = I — 7P^* . Ignoring the dependency on the policy of the 
right hand side of the equation one obtains the following approximation of the 
derivative 



^-^^Mx,a)+-fPaT-' 

Clt/k 



^TT*{x,b)(l)k{x,b) 
.be A 



(21) 



In contrast with the fixed point method of [16j . this approximation has a com- 
putational cost that is polynomial (due to the inverse) in the number of states. 
In the experiments we show that both methods provide comparative results in 
terms of accuracy. 

3.2 Comparison to other IRL algorithms 

Although we have presented the maximum likelihood algorithm in terms of prob- 
abilistic inference, it can be easily reformulated in terms of convex optimization, 
in order to be comparable to other IRL algorithms as in [TT]. For example, the 



update rule in ( 15 ) can be rewritten by replacing the sum over the dataset for a 



sum over all state-action pairs: 

1 die ^^^^ 

where he{x) is the observed state visitation frequencies of the expert's behavioi]^ 

M^) - (23) 
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Remember that the indicator function is defined as Ixa{y, b) = 



lifx = yAa = b 
otherwise 



On the Performance of Maximum Likelihood Inverse Reinforcement Learning 



9 



and 7r£;(a | x) the policy estimated from observations of the expert's behavior 

As shown in [5], the estimated expert pohcy can be inaccurate when there are 
many unvisited states or the demonstrations are scarce. When no observations 
are available for a state we will assume that the optimal policy is a random walk. 

In terms of maximization, the constant M can be dropped and we can replace 
the likelihood function for the Boltzmann policy since, by definition, £e(a;,a) = 
■ng{a I x). Therefore, we obtain: 

which can be integrated to obtain the similarity function being maximized: 

J{Trg,'D)= ^ ^E{x)TrEia \ x)logTrg{a \ x) (26) 

x,a£XxA 

The maximum likelihood approach, therefore, describes an alternative cost 
function for IRL problems. Indeed, following [17], it belongs to the the family of 
algorithms aiming to match the policies instead of the feature expectations as 
the policy matching algorithm described in P^. However, the latter uses a least 
square cost function instead of a maximum likelihood approach. 



4 Experiments 

In this section we evaluate the performance of the maximum likelihood IRL, 
which we called GIRL for Gradient-based-IRL, and compare it with other IRL 
algorithms: Policy Matching (PM) [Mj and the Multiplicative Weights Algorithm 
(MWAL) [25j. The experiments are divided in two different scenarios. First, we 
use a standard grid world as used in many IRL papers since Abbeel and Ng P]. 
Then, we use the sailing simulator proposed by Vanderbei [55] and used for IRL 
in Neu and Szepesvary |16| . 

In addition to compare the methods, we also evaluate the impact of approxi- 



mating the derivative using Eq. 21 Since for the reward model used in the paper 
the derivative is equal to the feature expectations, it is possible to compute them 
using the same approximation for the MWAL algorithm. In order to have a fair 
comparison in terms of computational time, we also report results where the 
feature expectations (i.e. the derivative) are estimated in a single step (hori- 
zon one). We will denote the full fixed point recursion as FP, the Independence 
assumption of Eq. [2l]as lA and the one step fixed point as FPl. 

First we offer a description of the metrics that we are going to use to evaluate 
the algorithms among them. Then, we will describe the setup used for the ex- 
periments and show some results. Finally, we provide some quantitative results 
about the different criteria to approximate the derivative of the Q-function. 
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4.1 Performance metrics 

We are interested in comparing the ability of the algorithms to a) recover the 
structure of the problem and b) propose policies that perform as good as the 
demonstrated expert. 

One measure of performance is the accumulated rewards using policy evalu- 
ation [53]. The value- function following policy tt from a state x with a reward 
function R is 

V^ix) =. {Rt+i + jRt+2 + l^Rt+3 + ...\xt=x} (27) 
The total value of a policy tt is: 

l^i?- E^«(^)^(^o = ^) (28) 

We call Re to the real reward which the expert is optimizing, and tt^; to the 
policy of the expert. The IRL algorithm converges to reward function Rjrl with 
corresponding optimal policy ttirl- Then, the maximum obtainable accumulated 
reward is and the accumulated reward obtained by the IRL solution V^'J^^ . 

Another measure of how good the algorithm solves the problem is the com- 
parison of the greedy policies of the expert and the learner. We measure what 
fraction of states of the greedy version of the learned policy ttihl match the 
actual optimal greedy policy tte- However, this measure can be sometimes mis- 
leading, since taking the wrong actions in critical parts of the problem can be 
disastrous reward-wise. Also in some parts of the problems there is more than 
one optimal action. This has been called the label bias [31] problem and the 
desired performance, whether to map the optimal policy or the distribution over 
paths, depends on the task. 



i 




Fig. 1. Narrow passage problem: (a) the true reward, (b) the expert poHcy. Paths 
problem: (c) the true reward, (d) the expert pohcy 



4.2 Grid world 

The grid world is made up of grid squares with five actions available to the agent, 
corresponding to moves in the compass directions (N,S,E,W) plus an action for 
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(a) Policy similarity, FP (b) Policy similarity, lA (c) Policy similarity, FPl 




(d) Acc. reward, FP (e) Acc. reward, FP (f) Acc. reward, FPl 



Fig. 2. Narrow passage. Average results over 10 runs. Top: policy similarity for (a) FP, 
(b) lA and (c) FPl. Bottom: Accumulated reward for (d) FP, (e) I A and (f) FPl. 



staying in the actual state. We assume that the real reward is in the linear 
form from equation The grid is divided in macro-cells which can span 
several grid squares. The reward features are the indicator function of the agent's 
state being inside a macro-cell, l^{x). Note that in this example, the real and 
estimated rewards depend only on the state R{x,a) = R{x). 

After doing some preliminary tests, we found that it is important for the grid 
world example to have some structure, in order to be able to draw conclusions 
about IRL algorithms. In many papers, the reward is chosen randomly resulting 
in problems with little structure. In those cases, the IRL algorithm needs only 
to find the goal point to perform well thus not helping to properly evaluate 
the algorithms. Based on this observation, we propose two different grid world 
problems with structure to asses performance: 



— A narrow passage problem, see Figures [T] (a) and (b), where one corner has 
slightly better reward than the other states but to get to the corner the agent 
has to be careful not to fall in the pit and walk along the narrow passage. 
This problem has 25 macro-cells arranged in a 5x5 square grid. The true 
reward accumulated by the expert is Vj^^ = 0.9964. 

— A path following problem, see Figures [l] (c) and (d), where most states give 
no reward but the goal state and the states following some paths to this 
goal. The agent has to follow these paths and beware of stepping outside on 
the way to the goal. This problem has 100 macro-cells arranged in a 10x10 
square grid. The true reward accumulated by the expert is = 0.9358 
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(a) Policy similarity, FP (b) Policy similarity, lA (c) Policy similarity, FPl 

— — — 




(d) Acc. reward, FP (e) Acc. reward, lA (f) Acc. reward, FPl 



Fig. 3. Path problem. Average results over 10 runs. Top: policy similarity for (a) FP, 
(b) lA and (c) FPl. Bottom: Accumulated reward for (d) FP, (e) I A and (f) FPl. 



Figure [2] show the results for the narrow passage problem with a macro-cell of 
size 2x2 statef0 The results for the full derivative show that the three methods 
perform quite well obtaining an accumulated reward over. 95 (compared to 0.99 
of the expert's) (Fig.[2jd)), with GIRL achieving the best results (.987), almost 
identical to the expert. Similar results are obtained for the final estimated policy 
where GIRL behaves better than the other methods. The results almost do not 
change when using the approximate derivative (see Figs. |2jb) and (e)). Again, 
GIRL obtains slightly better results than PM and MWAL. In both cases, this is 
mainly due to some solutions getting stuck at local minima that make some states 
fall to the pits or fail to get out of the pit through the fastest route. Indeed, this 
is illustrated by the variance bars that are smaller for the GIRL method than 
for the others. For the FPl approximation, GIRL behaves almost identically. 
PM degrades a little bit its performance and gets stuck in local minima more 
often. MWAL, on the other hand, is not able to recover any sensible policy. 
This is expected since that MWAL tries to match the feature expectations and, 
therefore, is more sensitive to approximations. However, it is surprising that it 
does work well with the lA approximation. 

We now analyze the results for the path following problem (Fig. [sj Roughly, 
the results are the same as in the narrow passage. All the methods behave sim- 
ilarly for the FP and the lA cases with an accumulated reward almost identical 
to the expert (.93^ In fact, the differences between them are smaller than in 

* Very similar results were obtained with bigger macro-cells which are not reported 
due to lack of space. For instance, for 4x4 macro-cells, GIRL accumulated 0.98122. 
^ The accumulated reward with 4x4 macro-cell was also best for GIRL (0.82). 
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the narrow passage. The main reason is that PM and MWAL do not seem to 
converge to worse local minima so often. This is indicated by a smaller vari- 
ance in the figures, although it is still bigger than the GIRL's one. As in the 
previous case PM and GIRL almost keep the same performance with the FPl 
approximation of the derivative and the MWAL algorithm fails to find a good 
solution. 

The difference in the computational cost varies enormously depending on the 
selected method to compute the derivative. For example, in the case of GIRL 
for the path following, the FP took 2.98 s per iteration on average, while the lA 
and the FPl were 1.14 and 0.48 respectively. The results are consistent with the 
fact that FP has an exponential cost, while lA has a polynomial cost. On the 
other hand, the difference between IRL algorithms were negligible, being more 
important the number of iterations until convergence. 

4.3 Sailing 

In the problem of "sailing" , proposed by Vanderbei , the task is to navigate 
a sailboat from one starting point to a goal point in the shortest time possible. 
The speed of the sailboat depends on the relative direction between the wind 
and the sailboat, and the wind follows a stochastic process. In the MDP context, 
the state contains the data about the sailboat's position in the grid, the current 
wind direction, and the current tack. The possible actions are to move to any 
of the 8 adjacent cells in the grid. The rewards are negative, and corresponds to 
the time required to carry out the movement under current wind, i.e. it is faster 
to sail away from the wind than into the wind, and changing the tack induces 
an additional delay. The reward function is a linear combination of six features 
-away, down, cross, up, into, delay-, see |29j for details. These features depend 
on the state and the action. Thus, the problem is more challenging than the grid 
worlds which depend only on the state, that is, the location on the grid. The 
true weights we used in the experiments are the same ones used by Vanderbei 
[29], namely 9* = (-1, -2, -3, -4,-100000, -3)^. 

Table [T] summarizes the results of solving the sailing problem with different 
demonstrations averaged over 10 runs. The demonstrations consisted of 5120 
expert trajectories each. We run the algorithms for one hundred iterations which 
was more than needed for convergence of the dissimilarity function J^ng^V) 
(around 40 IRL iterations) . Figure |4] shows the evolution of the two metrics, 
real accumulated reward of the learned policy Vj^'J^'" and proportion of states Xi 
where TT^ixi) — T:g{xi). For this problem, the reward accumulated by the expert 
is V^'^ = -11.76. 

The GIRL and the PM algorithms perform reasonably well for the FP, lA and 
FPl cases. GIRL had the best performance with an accumulated reward almost 
identical to the expert's one independently of the derivative used. PM have 
slight differences that could be due to sampling noise for the demonstrations. In 
general, it had a slightly smaller accumulated reward (between -2000 and -400) 
and also a higher variance than GIRL. Surprisingly, MWAL worked best in the lA 
case where it obtained an accumulated reward of -395. However, it did perform 
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Method 






T(s) 


GIRL-FP 

MWAL-FP 

PM-FP 


-11.76 
-15440.31 
-1630.48 


93.87 
83.87 
90.90 


332.44 
522.59 
376.26 


GIRL-I 

MWAL-I 

PM-I 


-11.76 
-395.45 
-2035.15 


93.91 
85.43 
91.05 


52.73 
47.95 
47.56 


GIRL-FP 1 
MWAL-FP 1 
PM-FP 1 


-11.76 
-16814.30 
-416.44 


94.53 
86.29 
93.91 


34.53 
35.06 
37.16 



Table 1. Experimental results for the sailing problem for 100 iterations of the GIRL, 
PM and MWAL methods using the fixed point method (FP), the independence assump- 
tion (lA) and a single step of the fixed point recursion (FPl). The reward accumulated 
by the expert is V^^ — —11.76. 





(a) Policy similarity, FP (b) Policy similarity, lA (c) Policy similarity, FPl 




^ GIRL "g 



(d) Acc. reward, FP (e) Acc. reward, lA (f) Acc. reward, FPl 

Fig. 4. Evolution of the performance of the different IRL methods in the sailing prob- 
lem: policy similarity and accumulated reward for (a,d) FP, (b,e) lA and (c,f) FPl. 



poorly in the other two cases where it did not converge to the right solution. 
This effect, which also appeared in the grid world, requires further investigation. 
In any case, the PM and MWAL solutions revealed that most of the loss in 
reward is due to a small number of wrong actions (going against the wind), 
which increased the final time. This was avoided by the GIRL method. Finally, 
the computational times depend mainly on the approximation used (see Table 
[I]) and, as in the grid world problems, they are very similar among methods. 
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5 Conclusions 

In this paper we have reviewed gradient based algorithms for the problems of in- 
verse reinforcement learning and apprenticeship learning. On one hand, we have 
discussed in detail the properties of a recently developed algorithm, the maxi- 
mum likelihood IRL. We have shown that the probabilistic inference approach 
has connections with IRL as convex optimization. In particular, it represents 
an alternative cost function to the least squares criterion of the policy match- 
ing algorithm. The experimental results show that for some typical problems 
the behavior of the likelihood based algorithm is at least as good as the other 
methods. On the other hand, one of the most expensive steps in gradient based 
methods is the computation of the derivative of the policy. We have analyzed 
an approximation of the derivative that exploits the fact that small changes in 
the reward function do not affect the policy. The approximated derivative can 
be computed, at every iteration, in polynomial time (instead of using a fixed 
point recursion). Results show that the obtained reward is much faster and as 
accurate as the ones obtained with the full derivative. 
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